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• ••••• The Hydra chip multiprocessor 
(CMP) integrates four MIPS-based processors 
and their primary caches on a single chip 
togethCT with ashared secondary cache. Astan- 
dard CMP offers implementation and perfor¬ 
mance advantages compared to wide-issue 
superscalar designs. However, it must be pro¬ 
grammed with a more complicated parallel 
programming model to obtain maximum per¬ 
formance. To amplify parallel programming, 
the Hydra CMP supports thread-level pecu¬ 
lation and memory renaming, a paradigm that 
allows performance ^ilar to a uniprocessor of 
comparable die area on integer programs. This 
article motivates the design of a CMP, 
describes the architecture of the Hydra design 
with a focus on its peculative thread support, 
and describes our prototype implementation. 

WyfiildaO»/P? 

As Moore’s law allows increasing numbers 
of smaller and faster transistors to be inte¬ 
grated on a single chip, new processors are 
being designed to use these transistors effec¬ 
tively to improve performance. Today, most 
microprocessor designers use the increased 
transistor budgets to build larger and more 
complex uniprocessors. However, several 
problems are beginning to make this approach 
to microprocessor design difficult to contin¬ 
ue. To address these problems, we have pro¬ 
posed that future processor design 
methodology shift from simply making pro¬ 
gressively larger uniprocessors to implement¬ 
ing more than one processor on each chip.* 
The following discusses the key reasons why 
angle-chip microprocessors are a good idea. 


I^ldian 

Designers primarly use additional tranas- 
tors on chips to extract more parallelism from 
programs to perform more work per clock 
cycle. While some transistors are used to build 
wider or more specialized data path logic (to 
switch from 32 to 64 bits or add fecial mul¬ 
timedia instructions, for example), most are 
used to build superscalar processors. These 
processors can extract greater amounts of 
instruction-level parallelism, or ILP, by find¬ 
ing nondependent instructions that occur 
near each other in the original program code. 

Unfortunately, there is only a finite amount 
of ILP present in any particular sequence of 
instructions that the processor executes 
because instructions from the same sequence 
are typically highly interdependent. As a 
result, processors that use this technique are 
seeing dimini^ing returns as they attempt to 
execute more instructions per clock cycle, even 
as the logic required to process multiple 
instructions per clock cycle increases qua- 
dratically. A CMP avoids this limitation by 
primarily using a completely diffoent type of 
parallelism: thread-level parallelism. We 
obtain TIP by running completely separate 
sequences of instructions on each of the sep- ~ 

arate processors simultaneously. Of course, a 
CMP may also exploit small amounts of IIP 
within each of its individual processors, since 
ILP and TLP are orthogonal to each othCT. 

VMtecH^ 

As CMOS gates become faster and chips 
become physically larger, the delay caused by 
interconnects between gates is becoming more 


0272-1732A)0/$10.00 © 2000 IEEE 


DISTRIBUTION STATEMENT A 

Approved for Public Release 
Distribution Unlimited 


20040130 226 









HvdraCMP 



DRAM main memory I/O devices 


Figure 1. An overview of the Hydra CM R 


significant. Due to rapid process technology 
improvement, within the next few years wires 
will only be able to transmit signals over a small 
portion of large processor chips during each 
clock cycle.^ However, aCMP can be designed 
so that each of its small processors takes up a 
relatively small area on a large processor chip, 
minimizing the length of its wires and simpli¬ 
fying the design of critical paths. Only the more 
infrequently used, and therefore less critical, 
Nvires connecting the processors need to be long. 

CfeagitirnB 

Processors are already difficult to design. 
Larger numbers of transistors, increasingly 
complex methods of extracting ILP, and wire 
delay considerations will only make this worse. 
ACMP can help reduce design time, however, 
because it allows a vSingle, proven processor 
design to be replicated multiple times over a 
die. Each processor core on a CMP can be 
much smaller than a competitive uniprocessor, 
minimizing the core design time. Also, a core 
design can be used over more chip generations 
simply by scaling the number of cores present 
on a chip. Only the processor interconnection 
logic is not entirely replicated on a CMP 

Wyaoit O^/ftiaadriW? 

Since a CMP addresses all of these potential 
problems in a straightforward, scalable man¬ 
ner, why arent CMPs already common? One 


reason is that integration densities are just 
reaching levels where these problems are 
becoming significant enough to consider a par¬ 
adigm shift in processor design. The primary 
reason, however, is because it is very difficult to 
convert today’s important uniprocessor pro¬ 
grams into multiproccvssor ones. 

Conventional multiprocessor programming 
techniques typically require careful data layout 
in memory to avoid conflicts between proces¬ 
sors, minimization of data communication 
between processors, and explicit synchro¬ 
nization at any point in a program where 
processors may actively share data. ACMP is 
much less sensitive to poor data layout and 
poor communication management, since the 
interprocessor communication latencies are 
lower and bandwidths are higher. However, 
sequential programs must still be explicitly 
broken into threads and synchronized prop¬ 
erly. Parallelizing compilers have been only 
partially successful at automatically handling 
these tasks for programmers.^ As a result, 
acceptance of multiprocessors has been slowed 
because only a limited number of program¬ 
mers have mastered these techniques. 

ftBBKJdacfeagi 

To understand the implementation and 
performance advantages of a CMP design, we 
are developing the Hydra CMP Hydra is a 
CMP built using four MIPS-based cores as its 
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individual prcx:essors (see Figure 1). Each core 
has its own pair of primary instruction and 
data caches, while all processors share a sin¬ 
gle, large on-chip secondary cache. The 
processors support normal loads and stores 
plus the MIPS load locked (LL) and store con¬ 
ditional (SC) instructions for implementing 
synchronization primitives. 

Connecting the processors and the sec¬ 
ondary cache togethar are the read and write 
buses, along with a small number of address 
and control buses. In the chip implanentation, 
almost all buses are virtual buses. While they 
logically act like buses, the physical wires are 
divided into multiple segments using repeaters 
and pipeline bulfos, >\here necessary, to avoid 
slowing down the core clock frequencies. 

The read bus acts as a genaal-purpose 
bus for moving data between the processors, 
secondary cache, and extanal interface to off- 
chip memory. It is wide enougji to handle an 
entire cache line in one clock cycle. This is an 
advantage possible with an on-chip bus that all 
but the most e;q)ensive multichip ^ems can¬ 
not match due to the large number of pins that 
would be required on all chip packages. 

The narrows write bus is devoted to writ¬ 
ing all writes made by the four cores directly to 
the secondary cache. This allows the perma¬ 
nent machine state to be maintained in the sec- 
ondary cache. The bus is pipelined to allow 
single-cycle occupancy by each write, prevent¬ 
ing it from becoming a system bottleneck. The 
write bus also permits Hydra to use a simple, 
invalidation-only coherence protocol to main¬ 
tain coherent primary caches. Writes broadcast 
over the bus invalidate copies of the same line 
in primary caches of the other processors. No 
data is ever parmanaitly lost due to these inval¬ 
idations because the permanent machine state 
is always maintained in the secondary cache. 

The write bus also enforces memory con¬ 
sistency in Hydra. Since aU writes must pass 
over the bus to become visible to the other 
processors, the order in which they pass is 
globally acknowledged to be the order in 
which they update shared memory. 

We were primarily concerned with mini¬ 
mizing two measurements of the design: the 
complexity of high-^eed logic and the laten¬ 
cy of interprocessor communication. Since 
decreasing one tends to increase the other, a 
CMP design must strive to find a reasonable 


balance. Any architecture that allows inter¬ 
processor communication between registers 
or the primary caches of different processors 
will add complex logic and long wires to paths 
that are critical to the cycle time of the indi¬ 
vidual processor cores. Of course, this com¬ 
plexity results in excellent interprocessor 
communication latencies—usually just one 
to three cycles. Past results have shown that 
sharing this closely is helpftil, but not if it 
extends the access time to the registers and/or 
primary caches. Consequently, we chose not 
to connect our processors this tightly. On the 
other hand, these results also indicated that 
we would not want to incur the delay of an 
off-chip reference, ^\frich can often take 100 
or more cycles in modem processors during 
each interprocessor communication. 

Because it is now possible to integrate rea¬ 
sonable-sire secondary caches on processor dies 
and since these caches are typically not ti^tly 
connected to the core logic, we chose to use 
that as the point of communication. In the 
Hydra architecture, this results in interproces¬ 
sor communication latendesof 10 to 20 cycles, 
vMich are fast enougji to minimize the perfor¬ 
mance impact from communication delays. 
After considering the bandwidth required by 
four single-issue MIPS processors sharing a sec¬ 
ondary cache, we concluded that a simple bus 
architecture would be sufficient to handle the 
bandwidth requirements for a four. This is 
accq)table for a four- to ei^t-processor Hydra 
implemaitation. However, designs with more 
cores or faster individual processors may need 
to use either more buses, crossbar intercon¬ 
nections, or a hierarchy of connections. 

SiAvvaepofcnTHTB 

We have performed extensive simulation to 
evaluate the potential performance of the 
Hydra design. Using a model with the mem¬ 
ory hierarchy summarized in Table 1 (next 
page), we compared the performance of a sin¬ 
gle Hydra processor to the performance of all 
four processors working together. We used the 
10 bendimarks summarized in Table 2 to gen¬ 
erate the results presented in Figure 2 (p. 75). 

The results indicate that for multipro- 
grammed workloads and hi^ly parallel bench¬ 
marks such as large matrix-based or multimedia 
applications, we can obtain nearly linear 
^eedup by using multiple Hydra processors 
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Table 1. The Hydra system configuration used for our simulations. 


Characteristic 

LI cache 

L2 cache 

Main memory 

Configuration 

Separate 1 and D SRAM cache 
pairs for each CPU 

Shared, on-chip SRAM cache 

Off-chip DRAM 

Capacity 

16 Kbytes each 

2 Mbytes 

128 Mbytes 

Bus width 

32-bit connection to CPU 

256-bit read bus + 32-bit write bus 

64-bit bus at half CPU speed 

Access time 

1 CPU cycle 

5 CPU cycles 

At least 50 cycles 

Associativity 

4 way 

4 way 

N/A 

: Line size 

32 bytes 

64 bytes 

4-Kbyte pages 

j Write policy 

Write through, no allocate on write 

Write back, allocate on writes 

Write back (virtual memory) 

Inclusion 

N/A 

Inclusion enforced by L2 on LI caches 

Includes all cached data 


Table 2. A summary of the conventionally parallelized applications we used to make 
performance measurements with Hydra. 


Purpose 

Application 

Source 

Description 

How parallelized 

General 

uniprocessor 

compress 

SPEC95 

Entropy-encoding file compression 

Not possible using 
conventional means 

applications 

eqntott 

SPEC92 

Logic minimization 

On inner bit vector 
comparison loop 


m88ksim 

SPEC95 

CPU simulation of Motorola88000 

Simulated CPU is pipelined 
across processors 


apsi 

SPEC95 

Weather and air pollution modeling 

Automatically by the SUIF 
compiler 

Matrix and 

multimedia 

MPEG2 

Mediabench suite 

Decompression of an MPEG-2 
bitstream 

“ Slices” in the input bitstream 
distributed among processors 

applications 

applu 

SPEC95 

Solver for partial differential equations 

Automaticdly by the SUIF 
compiler 


swim 

SPEC95 

Grid-based finite difference modeling 

Automatically by the SUIF 
compiler 


tomcatv 

SPEC95 

Mesh generation 

Automatically by the SUIF 
compiler 

Multiprogrammed 

workloads 

OLTP 

TPC-B 

Database transaction processing 

Different transactions execute 
in parallel 


pmake 

Unix command 

Parallel compilation of 
several source files 

Compilations of different files 
execute in parallel 


working together. These 5peedups typically will 
be much greater than those that can be 
obtained simply by making a single large ILP 
processor occupying the same area as the four 
Hydra processors.' In addition, multipro- 
grammed workloads are inherently parallel, 
utiile today’s compilers can automatically par¬ 
allelize most dense matrix Fortran applications.^ 
However, there is still a large category of less 
parallel applications, primarily integer ones that 
are not easily parallelized {eqntott, m88ksm, 
and apsi). The speedups we obtained with 
Hydra on these applications would be difficult 


or impossible to achieve on a conventional mul¬ 
tiprocessor, with the long interprocessor com¬ 
munication latencies required by a multichip 
design. Even on Hydra the speed improvement 
obtained after weeks of hand-parallelization is 
just comparable to that obtainable with a sim¬ 
ilar-size ILP processor with no programmer 
effort. More troubling, oompres represents a 
large group of applications that cannot be par¬ 
allelized at all using conventional techniques. 

IFiBEd-leud qmidicn AhelpfiJ eteiacn 

Applications such as database and Web seners 
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perform wdi on conventional multiprocessors, 
and therefore these applications will provide the 
initial motivation to adopt CMP architeaures, 
at least in the servo- domain. However, general 
uniprocessor applications must also work well 
on CMP architectures before they can ever 
replace uniprocessors in most oomputosw Hence, 
thoe needs to be a ^ple, elfective way to par- 
alleli 23 e even these applications. Hardware sup¬ 
port for thread-level ^)eculation is a promising 
technology that we diose to add to the basic 
Hydra design, because it eliminates the need for 
programmers to oqjlidtly divide their original 
program into indq)€ndent threads. 

Thread-level ^)eculation takes the sequence 
of instructions run during an existing uniproces¬ 
sor program and arbitrarily breaks it into a 
sequenced group of threads that may be run in 
parallel on a multiprocessor. To ensure that each 
program executes the same way that it did orig¬ 
inally, hardware must track all interthread 
dependencies. When a “later” thread in the 
sequence causes a true dq^endence violation by 
reading data too early, the hardware must ensure 
that the mis^)eculated thread—or at least the 
portion of it following the bad read—re-exe- 
cutes with the proper data This is a conaderably 
different mechanism from the one used to 
enforce dq)endaicies on conventional multi¬ 
processors. There, synchronization is inserted 
so that threads reading data from a different 
thread will stall until the correct value has been 
writtai. This process is complex because it is 
necessary to dd:ermine all possible true depen¬ 
dencies in a program before synchronization 
points may be insated. 

Speculation allows parallelization of a pro¬ 
gram into threads even without prior knowl¬ 
edge of where true dependencies between 
threads may occur. All threads simply run in 
parallel until a true dependency is detected 
while the program is executing. This greatly 
simplifies the parallelization of programs 
because it eliminates the need for human pro¬ 
grammers or compilers to statically place syn- 
dironization points into programs by hand or 
at compilation. All places vfrere synchroniza¬ 
tion would have been required are simply 
found dynamically when true dependencies 
actually occur. As a result of this advantage, 
uniprocessor programs may be par¬ 

allelized in a q>eculative system. While con¬ 
ventional parallel programmers must 
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Figure 2, Speedup of convemioheBy paralleflzed applications running on 
Hydra compared with the origin#titilpropessor code running on one of 
Hydra’s four processors. v 


constantly worry about maintaining program 
correctness, programmers parallelizing code 
for a speculative system can focus solely on 
achieving maximum performance. The spec- 
ulative hardware will ensure that the parallel 
code always performs the same computation 
as the original sequential program. 

Since parallelization by speculation dynam¬ 
ically finds parallelism among program threads 
at runtime, it does not need to be as conserv¬ 
ative as conventional parallel code. In many 
programs there are many potential dependen¬ 
cies that may result in a true dq>endency, but 
where dependencies seldom if ever actually 
occur during the execution of the program. A 
q)eculative system may attempt to run the 
threads in parallel anyway, and only back up 
the later thread if adq)endency actually occurs. 

On the other hand, a system dependent on 
synchronization must always synchronize at 
any point where a dependency might occur, 
based on a static analysis of the program, 
whether or not the dependency actually ever 
occurs at runtime. Routines that modify data 
objects throu^ pointers in C programs are a 
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Rgure 3. Rve basic requirements for special coherency hardware: a sequential program that 
can be broken into two threads (a); forwarding and violations caused by interaction of reads 
and writes (b); speculative memory state eliminated following violations (c); reordering of 
writes following thread commits (d); and memory renaming among threads (e). 


frequent source of this prob¬ 
lem within many integer 
applications. In these pro¬ 
grams, a compiler (and some¬ 
times even a programmer 
performing hand paralleliza¬ 
tion) will typically have to 
as^me that any later pointer 
reads may be dependent on 
the latest write of data using a 
pointer, even if that is rarely 
or never the case. As a result, 
a significant amount of 
thread-level parallelism can 
be hidden by the way the 
uniprocessor code is written, 
and therefore wasted as a 
compiler conservatively par¬ 
allelizes a program. 

Note that speculation and 
synchronization are not 
mutually exclusive. A pro¬ 
gram with speculative threads 
can still perform synchro¬ 
nization around uses of 
dependent data, but this syn¬ 
chronization is optional. Asa 
result, a programmer or feed¬ 
back-driven compiler can still 
add synchronization into a 
^eculatively parallelized pro¬ 
gram if that helps the pro¬ 
gram execute faster. In our 
experiments, we found a few 
cases where synchronization 
protecting one or two key 
dependencies in a specula¬ 
tively parallelized program 
produced speedup by dra¬ 
matically reducing the num¬ 
ber of violations that 
occurred. Too much synchro¬ 
nization, however, tended to 
make the speculative paral¬ 
lelization too conservative and 
was a detriment to perfor¬ 
mance. 

To support ^culation, we 
need fecial coherency hard¬ 
ware to monitor data ^ared 
by the threads. This hardware 
must fulfill five basic require¬ 
ments, illustrated in Figure 3. 
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The figure shows some typical data access pat¬ 
terns in two threads, i and / + 1. Figure 3a 
shows how data flows through these accesses 
when the threads are run sequentially on a nor¬ 
mal uniprocessor. Figures 3b-d show how the 
hardware must handle the requirements. 

1. Forward data between parallel threads. 
While good thread selection can mini¬ 
mize the data shared among threads, typ¬ 
ically a significant amount of faring is 
required, simply because the threads are 
normally generated from a program in 
which minimizing data Glaring was not 
a design goal. As a result, a ^eculative 
system must be able to forward shared 
data quickly and efficiently from an ear¬ 
lier thread running on one processor to a 
later thread running on another. Figure 
3b depicts this. 

2. Detect when reads occur too early (RAW 
hazards). The ^eculative hardware must 
provide a mechanian for tracking reads 
and writes to the shared data memory. If 
a data value is read by a later thread and 
subsequently written by an earlier thread, 
the hardware must notice that the read 
retrieved incorrect data since a true 
dependence violation has occurred. Vio¬ 
lation detection allows the system to 
determine when threads are not actually 
parallel, so that the violating thread can 
be re-executed with the correct data val¬ 
ues. See Figure 3b. 

3. Scfdy discard speculative state cfier viola¬ 
tions As depicted in Figure 3c, specula¬ 
tive memory must have a mechanism 
allowing it to be reset after a violation. 
All speculative changes to the machine 
state must be discarded after a violation, 
wfiile no permanent machine state may 
be lost in the process. 

4. Retire speculativewritesin the correct order 
(WAW hazards). Once ^eculative threads 
have completed successfully, their state 
must be added to the permanent state of 
the machine in the correct program 
order, considering the original sequenc¬ 
ing of the threads. This may require the 
hardware to delay writes from later 
threads that actually occur before writes 
from earlier threads in the sequence, as 
Figure 3d illustrates. 


5. Prcn^idememory renaming (WAR hazards). 
Figure 3e depicts an earlier thread read¬ 
ing an address after a later processor has 
already written it. The speculative hard¬ 
ware must ensure that the older thread 
cannot “see” any changes made by later 
threads, as these would not have occurred 
yet in the original sequential program. 
This process is complicated by the fact 
that each processor will eventually be 
running newly generated threads (/ + 2 
in the figure) that will need to “see” the 
changes. 

In some proposed ^eculative hardware, the 
logic enforcing these requirements monitors 
both the processor registers and the memory 
hierarchy."^ However, in Hydra we chose to 
have hardware only enforce speculative coher¬ 
ence on the memory system, while software 
handles register-level coherence. 

In addition to ^eculative memory support, 
any system supporting speculative threads 
must have a way to break up an existing pro¬ 
gram into threads and a mechanism for con¬ 
trolling and sequencing those threads across 
multiple processors at runtime. This general¬ 
ly consists of a combination of hardware and 
software that finds good places in a program 
to create new, speculative threads. The system 
then sends these threads off to be processed 
by the other processors in the CMF 

While in theory a program may be ^ecu- 
latively divided into threads in a completely 
arbitrary manner, in practice one is limited. 
Initial program counter positions (and, for 
Hydra, register states) must be generated 
wiien threads are started. As a result, we invest 
tigated two ways to divide a program into 
threads: loops and subroutine calls. With 
loops, several iterations of a loop body can be 
started speculatively on multiple processors. 
As long as there are only a few straightforward 
loop-carried dependencies, the execution of 
loop bodies on different processors can be 
overlapped to achieve speedup. Using sub¬ 
routines, a new thread can start to run the 
code following a subroutine call’s return, wfrile 
the original thread actually executes the sub¬ 
routine itself (or vice-versa). As long as the 
return value from the subroutine is predictable 
(typically, when there is no return value) and 
any side effects of the subroutine are not used 
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Rgure 4. An overview of Hydra with speculative support. 


immediately, the two threads can run in par¬ 
allel. In general, achieving speedup with this 
technique is more challenging because thread 
sequencing and load balancing among the 
processors is more complicated with subrou¬ 
tines than loops. 

Once threads have been created, the specu¬ 
lation runtime system must select the four least 
speculative threads available and allocate them 
to the four processors in Hydra. Note that the 
least speculative, or head, thread isjpecial. This 
thread is actually not speculative at all, since 
all older threads that could have caused it to 
violate have already completed. As a result, it 
can handle events that cannot normally be 
handled speculatively (such as operating sys¬ 
tem calls and exceptions). Since all threads 
eventually become the head thread, simply 
stalling a thread until it becomes the head will 
allow the thread to process these events dur¬ 
ing speculation. 

IrrideTBlirp^HUdiminhjda 

Speculation is an effective method for 
breaking an existing uniprocessor program 
into multiple threads. However, the threads 
created automatically by speculation often 
require fast interprocessor communication of 


large amounts of data After all, minimizing 
communication between arbitrarily created 
threads is not a design consideration in most 
uniprocessor code. ACMP like Hydra is nec¬ 
essary to provide low-enough interprocessor 
communication latencies and high-enough 
interprocessor bandwidth to allow the design 
of a practical speculative thread mechanism. 

Among CMP designs. Hydra is a particu¬ 
larly good target for speculation because it has 
write-through primary caches that allow all 
processor cores to snoop on all writes per¬ 
formed. This is very helpful in the design of 
violation-detection logic. Figure 4 updates 
Figure 1, noting the necessary additions. The 
additional hardware is enabled or bypassed 
selectively by each memory reference, depend¬ 
ing upon whether a ^eculative thread gener¬ 
ates the reference. 

Most of the additional hardware is con¬ 
tained in two major blocks. The first is a set of 
additional tag bits added to each primary 
cache line to track whether any data in the line 
has been speculatively read or written. The 
second is a set of write buffers that hold ^ec- 
ulative writes until they can be safely com¬ 
mitted into the secondary cache, which is 
guaranteed to hold only non^eculative data 
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One buffer is allocated to I 
each ^eculative thread cur- 
rently running on a Hydra | . 

processor, so the writes from ? 
different threads are always ; 
kept separate. Only wfren : 
speculative threads complete ^ . 
successfully are the contents of ^ ' 

these buffers actually written ^ 
into the secondary cache and ? / 

madepermanait. Asshownin ! 

Hgure 4, one or more extra ^ ^ 

buffers may be included to _ 

allow buffers to be drained 
into the secondary cache in 

cacnG 

parallel with ^eculative exe¬ 
cution on alioftheCPUs. We ; ^ ~ 

have previously published^ ; 

more details about the addi- j 

tional primary cache bits and 
secondary cache buffers. Figure 5. How 

To control the thread cache. The LI 
sequencing in our system, we parallel In the 
also added a small amount of here) pull in tl 

hardware to each core using and L1 with tl 

the MIPS coprocessor inter¬ 
face. These simple “specula¬ 
tion coprocessor^’ consist of several control 
registers, a set of duplicate secondary cache 
buffer tags, a state machine to track the cur¬ 
rent thread sequencing among the processors, 
and intoTUpt logic that can start software han¬ 
dlers when necessary to control thread 
sequencing. These software handlers are 
re^nsible for thread control and sequencing. 
Prior publications^ ^ provide complete details 
of how these handlers work for sequence spec¬ 
ulative threads in the Hydra hardware. 

Together with the architecture of Hydra’s 
existing write bus, the additional hardware 
allows the memory system to handle the five 
memory system requirements outlined previ- 
oudy in the following ways: 

Forward data between parallel threads. 
When a speculative thread writes data 
over the write bus, all more-speculative 
threads that may need the data have their 
current copy of that cache line invalidat¬ 
ed. This is similar to the way the system 
works during non^eculative operation. 
If any of the threads subsequently need 
the new speculative data forwarded to 
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Modified bit of the newly read line is set 
If there is a hit in any of these buffers 
(or, more optimally, only /-1 and /) 
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Figure 5. How secondary cache speculaliyi&ffers are read. 1) A G^U reads from its LI 
cache. The L1 read bit of any hit lines aresli 2) The L2 and write buffers are checked in 
parallel In the event of an L1 miss* Priorty enoK^son each byte (indicated by priorities 1-4 
here) pull in the newest bytes written toallne:The appropriate word is delivered to the CPU 
and LI with the LI modified and preinvaikli^frte set approfiMlately* 


them, they will miss in their primary 
cache and access the secondary cache. At 
this point, as is outlined in Figure 5, the 
speculative data contained in the write 
buffers of the current or older threads 
replaces data returned from the sec¬ 
ondary cache on a byte-by-byte basis just 
before the composite line is returned to 
the processor and primary cache. Over¬ 
all, this is a relatively simple extension to 
the coherence mechanism used in the 
baseline Hydra design. 

2. Detect when reads occur too early. Prima¬ 
ry cache bits are set to mark any reads 
that may cause violations. Subsequently, 
if a write to that address from an earlier 
thread invalidates the address, a violation 
is detected, and the thread is restarted. 

3. Scfdy discard speculative state cfter viola¬ 
tions Since all permanent madiine state in 
Hydra is always maintained within the sec¬ 
ondary cache, anything in the primary 
caches may be invalidated at any time 
without risking a loss of permanent state. 
As a result, any lines in the primary cache 
containing ^eculative data (marked with 
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Other CMPandTlS efforts 
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cf theprooessorstoallowdrecl reg^er^o^^&ocnrnncatlcn Along withha'd 
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tobecDpIdtedat theofienseof rTneoonfpl€xprooesaoroore& Thed^ 
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wasairif iedpirTHiy cache; a a±lessre9ddionbuffer (AFQ. 
has rrost of the oorrpleMty of KVcte’ssecxndary cache 
level, making it dffioit toirrplermt. later, they prqposedthespecdatlvevera 
cache TheS^ieeswritetedcpimaiycachK buffer speoiativewrites Inthe pri¬ 
mary caches, usiipasopfisticatedocherenoesche^ 
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cry referenoes that requre fonradr^ to andher speculative thread Their sirrplifled 
Q^Dnusl danitsspecdativeocrtertsaseadithreadcorr^ irf ortunately resdt- 

inginheayburstsof bus activity 

• M/rMnr7Bshfna®Thisone<tTipCMPdeagihasthreeproo0saorsthatshareaprirTB’ 
rycacheandcanoonTTlncaterecpster•^o^egpsterth^aJc^aao^^ &h processor 
can also avitch d^nanically among saaral threads. As a result, the hadware oor> 
nectlngproBssorstogdheriscMtecxyrplexanddciw. However, progaTBepecuted 
ontheM-nrBcrtnecanbeparalldiaBdugngNeyfinegannfecharisrrethBt areinpos- 
sibeonanarcrtteclurethet sharesoutadeef the processor cores, llkeh^a ftrfa- 
rrerxBresdtssIxwthat on typical applicdlorBepdrerridyfine^ainedpa^dldizstion 
is often nd as effective as parallelism at the levels that hV^a can 0 <ploit. The over¬ 
head incurred by frequent syndToni2atjons reduces the effectiveness. 

Ftartly SjriandlEMarnmBdplanstorTBteC]^ SLiTlsplarBc^erlirTitedTlSsLpport. 

• IBM Rwer4:° This is the first oormerdally proposed CMPtargeted at servers axJ 
dher systems that already nnate use of oorventional mJtiprooessors. It reserrtles 
lV*abU doesnd hcMeTLSsipport (anunrBcessaryfeetLiefannDsl typesof servers) 
and has two large processors per dip. 
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oiion It alsosLfportsasLlKUinBtesedTl^ MAOi^inlerript haiJware 
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to ^obal objects that can be shared among subroutines are deerly defined in Java 
bytecode binaries. Havever, e^ei with the shared prirrerycachedlc^ 
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to scale to nwe than tvw prooessciB dB to the overtBad of the softwa'e h^tJers. 


a special modified bit) may simply be inval¬ 
idated all at once to clear any ^^eculative 


state from a primary cache. In parallel with 
this operation, the secondary cache buffer 
for the thread may be emptied to discard 
any speculative data written by the thread 
without damaging data written by other 
threads or the permanent state of the 
machine in the secondary cache. 

4. Retire^yeculativewritesin the correct order. 
Separate secondary cache buffers are 
maintained for each thread. As long as 
these are drained into the secondary 
cache in the original program sequence 
of the threads, they will reorder ^ecula- 
tive memory references correctly The 
thread-sequencing system in Hydra also 
sequences the buffer draining, so the 
buffers can meet this requirement. 

5. Each processor 
can only read data written by itself or ear¬ 
lier threads when reading its own prima¬ 
ry cache or the secondary cache buffers. 
Writes from later threads donT cause 
immediate invalidations in the primary 
cache, since these writes should not be vis¬ 
ible to earlier threads. This allows each pri¬ 
mary cache to have its own local copy of 
a particular line. H owever, these “ignored” 
invalidations are recorded using an addi¬ 
tional pre-in validate primary cache bit 
associated with each line. This is because 
they must be processed before a different 
^eculative or nonspeculative thread exe¬ 
cutes on this processor. If a thread has to 
load a cache line from the secondary 
cache, the line it recovers only contains 
data that it should actually be able to‘ke” 
from its own and earlier buffers, as Figure 
5 indicates. Finally, ifftiture threads have 
written to a particular line in the primary 
cache, the pre-in validate bit for that line 
is set. When the current thread completes, 
these bits allow the processor to quickly 
simulate the effect of all stored invalida¬ 
tions caused by all writes from later proces¬ 
sors all at once, before a new thread begins 
execution on this processor. 

Based on the amount of memory and logic 
required, we estimate that the cost of adding 
speculation hardware is comparable to adding 
an additional pair of primary caches to the sys¬ 
tem. This enlarges the Hydra die only by a 
few percent. 












Table 3. A summary of the speculatively parallelized applications used to make performance mMSUi^inft.6nt#' 
with Hydra. Applications in italics were also hand-parallelized and run on the base Hydra ded^. 


Ai^llcatlon Source 

Description 

How parallelized ^ ^^ 

ccmpress 

SPEC95 

Entropy-encoding compression of a file 

Speculation on loop for processing esK^ Input 
character' 

eqntott 

SPeC92 

Logic minimization 

Subroutine speculatiort on core quick sort routifilfe 

grep 

Unix command 

Finds matches to a regular expression in a file 

Speculation on loop for processing eac^ Input line. ^ 

mSdksto 

SPECS5 

CPU simulation of Motorola88000 

Speculation on loop for processing each ; ; ) 

.. 'instruction ■'■ ■ ■ ' '' "- 

wc 

Unix command 

Counts the number of characters, words, 

and lines in a file 

Speculation on loop for processing each 
character ’ 

#eg 

SPEC95 

Compression of an RGB image to a JPEG file 

Speculation on several different too|^ us^ to' 
process the image ; A 

MPBSZ 

Mediabench suite 

Decompression of an M PEG-2 bitstream 

Speculation on loop for processing sBces \ : 

Sim 

spEcea 

Neural network training 

Speculation on 4 key loops . ^ 

cholesky 

Numeric recipes 

Cholesky decomposition and substitution 

Speculation on main decomposition and ;> 

substitution loops . . 

epr 

SPEC92 

Inner ear modeling 

Speculation on outer loop of mcxlel /i 

simplex 

Numeric recipes 

Linear algebra kernels 

Speculation on several small loops - 
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We extended our model of the Hydra sys¬ 
tem with peculation support to verify our 
thread-level peculation mechanisms. On all 
pplications except which was par¬ 

allelized using subroutine speculation — we 
used our source-to-source loop-translating 
system to convert loops in the original pro¬ 
grams to their peculative forms. Even with 
our simple, early programming environment, 
we could parallelize programs just by picking 
out v/hich loops and/or subroutines we want¬ 
ed peculatively parallelized. We then let the 
tools do the rest of the work for us. 

Both because this design environment is C- 
based and Fortran programs can often be auto¬ 
matically parallelized using compilers such as 
SUIF, we limited our set of pplications to a 
wide variety of integer and floating-point C 
programs. Table 3 lists them. Many of these 
programs are difficult or impossible to paral¬ 
lelize using conventional means due to the 
presence of fi-equent true dpendencies. How¬ 
ever, all of the more highly parallel pplications 
listed in Table 3 can be parallelized by hand. 
Still, automatically parallelizing compilers are 
stymied by the presence of many C pointers 
in the original source code that they cannot 
statically disambiguate at compile time. 

Figure 6 summaries our results. After our 
initial peculative runs with unmodified loops 



Rgure 6. Speedup of speculatively pariitelized applications running on Hydra 
compared with the origin^ uniprocessor code running on one of Hydra’s four 
processors. The gray areas show the Improved perfomnance following tuning 
with feedback-based code. 


from the original programs, we used feedback 
from our first simulations to optimize our 
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benchmarks by hand. This avoided the most 
critical violations that caused large amounts 
of work to be discarded during restarts. These 
optimizations were usually minor—usually 
just moving a line of code or two or adding 
one synchronization point. ^ However, they 
had a dramatic impact on benchmarks such 
asMPEG2. 

Overall, these results are at least compara¬ 
ble to and sometimes better than a single large 
uniprocessor of similar area running these 
applications, based on our past work simulat¬ 
ing CMPs and uniprocessors.' 

Of course, a CMP can also perform faster 
by running fully parallelized programs with¬ 
out speculation, when those programs are 
available. A uniprocessor cannot. It is even 
possible to mix and match using multipro¬ 
gramming. For example, two processors could 
be working together on a speculative applica¬ 
tion, while others work on a pair of com¬ 
pletely different Jobs. While we have not fiilly 
implemented it in our prototype, we could 
relatively easily enhance the speculative sup¬ 
port routines so that multiple speculative tasks 
could run simultaneously. Two processors 
would run one speculative program, and two 
would run a completely different speculative 
program. In this manner, it is possible for a 
CMP to nearly always outperform a large 
uniprocessor of comparable area. 

Speedups are only a part of the story, how¬ 
ever. Speculation also makes parallelization 
much easier, because a parallelized program 
that is guaranteed to work exactly like the 
uniprocessor version can be generated auto¬ 
matically. As a result, programmers only need 
to worry about choosing which program sec¬ 
tions should be speculatively parallelized and 
tweaked for performance optimization. Even 
when optimization is required, we found that 
speculative parallelization typically took a sin¬ 
gle programmer a day or two per application. 
In contrast, hand parallelization of these C 
benchmarks typically took one programmer 
anywhere from a week to a month, sdnce it was 
necessary to worry about correctness and per¬ 
formance throughout the process. As a result, 
even though adding speculative hardware to 
Hydra will make the chip somewhat harder 
to design and verify, the reduced cost of gen¬ 
erating parallel code will offer significant 
advantages. 


l^ct^irT|j9TBldicri 

To validate our simulations, develop more 
^>eculative software, and verify that the Hydra 
architecture is as simple to design as we believe 
it to be, we are working with IDT to manu¬ 
facture a prototype Hydra It will use IDT’s 
embedded MIPS-based RC32364 core and 
SRAM macrocells. We have a Verilog model 
of the chip and are moving it into a physical 
design using synthesis. With 8-Kbyte prima¬ 
ry instruction and data caches and approxi¬ 
mately 128 Kbytes of on-chip secondary 
cache, the die (depicted in Figure 7) will be 
about 90 mm^ in IDT’s 0.25-micron process. 
We based thCvSe area and layout estimates on 
the current RC32364 layout and area esti¬ 
mates of new components obtained using our 
Verilog models of the different sections of the 
Hydra memory system. 

The memory system we are designing to 
connect the IDT components together con¬ 
sists of the following: 

• ou r specu lat i ve coprocessor, 

• interconnection buses, 

• controllers for all memory resources, 

• speculative buffers and bits, 

• a simple off-chip main memory con¬ 
troller, and 

• an I/O and debugging interface that we 
can drive using a host workstation. 

We are designing most of this using a fair¬ 
ly straightforward standard cell methodology. 
The clock rate target for the cores is about 250 
MH z, and we plan on inserting pipeline stages 
into our memory system logic as necessary to 
avoid slowing the cores. The most critical part 
of the circuit design will be in the central arbi¬ 
tration mechanism for the memory con¬ 
trollers. This circuit is difficult to pipeline, 
must accept many requests for arbitration 
every cycle, and must re^ond to each request 
with a grant signal. The large numbers of high 
fan-in and fan-out gates that must operate 
during every cycle make it a challenging cir¬ 
cuit design problem. 

A chip multiprocessor such as Hydra will 
be a high-performance, economical alter¬ 
native to large single-chip uniprocessors. A 
CMP of comparable die area can achieve per¬ 
formance similar to a uniprocessor on integer 
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Rgure 7. The floor plan of the Hydra implementation. . 


programs using thread-level speculation 
mechanisms. In addition, with multipro- 
grammed workloads or highly parallel appli¬ 
cations a CMP can significantly outperform a 
uniprocessor of comparable cost by operating 
as a multiprocessor. Furthermore, the hard¬ 
ware required to support thread-level specu¬ 
lation is not particularly area-intensive. 
Inclusion of this feature is not expensive, even 
though it can significantly increase the num¬ 
ber of programs that can be easily parallelized 
to fully use the CMP NTO 
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